Modeling spontaneous speech variability for large vocabulary continuous speech recognition

نویسنده

  • Hauke Schramm
چکیده

In this work a number of novel techniques for improved treatment of spontaneous speech variabilities in large vocabulary automatic speech recognition are developed and evaluated on US English conversational speech and spontaneous medical dictations. Two main aspects of spontaneous speech modeling are addressed: The general handling of pronunciation variability and the individual and parallel treatment of multiple speech variabilities in the acoustic and pronunciation model of a one-pass speech recognizer. The problem of an optimal incorporation of multiple alternative pronunciations into the search framework is addressed in the first part of the thesis. This includes the question of how to efficiently combine the probabilistic contributions of alternative pronunciations in the course of a left to right search procedure. The well known maximum approximation, usually applied in this context, is compared to a novel time synchronous sum approximation technique which integrates alternative pronunciations in a weighted sum of acoustic probabilities. It is shown on a conversational speech task that this approach outperforms the maximum approximation by 2% relative and reduces the search costs by 7%. Another important issue with respect to the incorporation of alternative pronunciations into the search framework is the statistical weighting of the pronunciations. The usually applied pronunciation unigram prior probabilities are typically estimated by the relative frequencies of pronunciations in the training hypotheses. This standard maximum likelihood solution is compared to a novel discriminative training scheme which is an extension of the Discriminative Model Combination technique, proposed in [Beyerlein 01]. The developed iterative reestimation procedure is shown to adjust the influence of a specific pronunciation prior probability in the discriminant function in dependence of (1) the word error rate, (2) the frequency of occurrence of this pronunciation in the correct hypothesis and its rivals, and (3) the underlying acoustic, pronunciation and language model. An evaluation of this technique on a conversational speech task showed a 6.5% relative improvement on the training corpus and a 2% relative gain on an independent test set. The second major part of this thesis addresses the development and evaluation of a novel training and search framework which enables a specific, parallel treatment of multiple speech variabilities in the acoustic and pronunciation model. This technique (1) classifies portions of speech (e.g. words) with respect to given variability classes (e.g. rate of speech), (2) builds class specific acoustic and pronunciation models, and (3) properly combines these models later in the search procedure on a word level basis. A theoretical framework for an efficient integration of the class specific acoustic and pronunciation models into a one-pass search procedure is developed which incorporates contributions from class specific alternatives in a weighted sum of acoustic probabilities. This multi variability framework applies a very general model combination technique which may be applied to combine arbitrary acoustic and pronunciation models on word level. In this work, it is especially used for a parallel, explicit treatment of three important spontaneous speech variabilities: pronunciation variability, rate of speech variability, and filled pause variability. The best multi variability system combines 6 class specific acoustic and pronunciation models on word level and achieves a word error rate reduction of 13% relative on a highly spontaneous medical dictation task and a gain of 9% relative on conversational speech.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting

Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...

متن کامل

Probabilistic Speaker-Class based Acoustic Modeling for Large Vocabulary Continuous Speech Recognition

In this paper, a probabilistic speaker-class (PSC) based acoustic modeling method is proposed for taking into account speaker variability influence in HMM-based speech recognition systems. Firstly, within the context of speaker-class based speech recognition, an experiment is conducted to investigate the performance of speaker-class recognition based on hard-cut speaker clustering. Then, in the...

متن کامل

Automatic generation of pronunciation lexicons for Mandarin spontaneous speech

Pronunciation modeling for large vocabulary speech recognition attempts to improve recognition accuracy by identifying and modeling pronunciations that are not in the ASR systems pronunciation lexicon. Pronunciation variability in spontaneous Mandarin is studied using the newly created CASS corpus of phonetically annotated spontaneous speech. Pronunciation modeling techniques developed for Engl...

متن کامل

Modeling between-word coarticulation in continuous speech recognition

This paper describes the addition of between-word coarticulation modeling into SPHINX, an accurate Iarge-vocabulary speakerindependent continuous speech recognition system. Between-word coarticulation is a major source of phonetic variability in continuous speech. By detailed modeling of between-word triphones and utilizing the generalized triphone technique, we obtain an error ;ate reduction o...

متن کامل

Modeling Lexical Tones for Mandarin Large Vocabulary Continuous Speech Recognition

Modeling Lexical Tones for Mandarin Large Vocabulary Continuous Speech Recognition

متن کامل

Spontaneous Thai speech recognition

This paper expands previous work on Thai speech recognition, investigating pronunciation changes such as syllable and phoneme elisions as well as phoneme shifts in Thai spontaneous speech. We compare several approaches to model these effects in large vocabulary continuous speech recognition across multiple domains. This work includes experiments on two new speech databases that significantly al...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006